Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Structured Generative AI. How to constrain your model to output… | by Oren Matar | Apr, 2024

April 18, 2024
in AI Technology
Reading Time: 6 mins read
0 0
A A
0
Share on FacebookShare on Twitter




How to constrain your model to output defined formats

7 min read · 5 hours ago

In this post I will explain and demonstrate the concept of “structured generative AI”: generative AI constrained to defined formats. By the end of the post, you will understand where and when it can be used and how to implement it whether you’re crafting a transformer model from scratch or utilizing Hugging Face’s models. Additionally, we will cover an important tip for tokenization that is especially relevant for structured languages.

One of the many uses of generative AI is as a translation tool. This often involves translating between two human languages but can also include computer languages or formats. For example, your application may need to translate natural (human) language to SQL:

Natural language: “Get customer names and emails of customers from the US”
SQL: “SELECT name, email FROM customers WHERE country = ‘USA'”

Or to convert text data into a JSON format:

Natural language: “I am John Doe, phone number is 555-123-4567, my friends are Anna and Sara”
JSON: {name: “John Doe”, phone_number: “555-123-5678”, friends: {name: [[“Anna”, “Sara”]]}}

Naturally, many more applications are possible, for other structured languages. The training process for such tasks involves feeding examples of natural language alongside structured formats to an encoder-decoder model. Alternatively, leveraging a pre-trained Language Model (LLM) can suffice.

While achieving 100% accuracy is unattainable, there is one class of errors that we can eliminate: syntax errors. These are violations of the format of the language, like replacing commas with dots, using table names that are not present in the SQL schema, or omitting bracket closures, which render SQL or JSON non-executable.

The fact that we’re translating into a structured language means that the list of legitimate tokens at every generation step is limited, and pre-determined. If we could insert this knowledge into the generative AI process we can avoid a wide range of incorrect results. This is the idea behind structured generative AI: constrain it to a list of legitimate tokens.

A quick reminder on how tokens are generated

Whether employing an encoder-decoder or GPT architecture, token generation operates sequentially. Each token’s selection relies on both the input and previously generated tokens, continuing until a <end> token is generated, signifying the completion of the sequence. At each step, a classifier assigns logit values to all tokens in the vocabulary, representing the probability of each token as the next selection. The next token is sampled based on those logits.

The decoder classifier assigns a logit to every token in the vocabulary (Image by author)

Limiting token generation

To constrain token generation, we incorporate knowledge of the output language’s structure. Illegitimate tokens have their logits set to -inf, ensuring their exclusion from selection. For instance, if only a comma or “FROM” is valid after “Select name,” all other token logits are set to -inf.

If you’re using Hugging Face, this can be implemented using a “logits processor”. To use it you need to implement a class with a __call__ method, which will be called after the logits are calculated, but before the sampling. This method receives all token logits and generated input IDs, returning modified logits for all tokens.

The logits returned from the logits processor: all illegitimate tokens get a value of -inf (Image by author)

I’ll demonstrate the code with a simplified example. First, we initialize the model, we will use Bart in this case, but this can work with any model.

from transformers import BartForConditionalGeneration, BartTokenizerFast, PreTrainedTokenizer
from transformers.generation.logits_process import LogitsProcessorList, LogitsProcessor
import torch
name = ‘facebook/bart-large’
tokenizer = BartTokenizerFast.from_pretrained(name, add_prefix_space=True)
pretrained_model = BartForConditionalGeneration.from_pretrained(name)

If we want to generate a translation from the natural language to SQL, we can run:

to_translate = ‘customers emails from the us’
words = to_translate.split()
tokenized_text = tokenizer([words], is_split_into_words=True)
out = pretrained_model.generate(torch.tensor(tokenized_text[“input_ids”]), max_new_tokens=20,)
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(out[0], skip_special_tokens=True))

Returning ‘More emails from the us’

Since we did not fine-tune the model for text-to-SQL tasks, the output does not resemble SQL. We will not train the model in this tutorial, but we will guide it to generate an SQL query. We will achieve this by employing a function that maps each generated token to a list of permissible next tokens. For simplicity, we’ll focus only on the immediate preceding token, but more complicated mechanisms are easy to implement. We will use a dictionary defining for each token, which tokens are allowed to follow it. E.g. The query must begin with “SELECT” or “DELETE”, and after “SELECT” only “name”, “email”, or “id” are allowed since those are the columns in our schema.

rules = {‘<s>’: [‘SELECT’, ‘DELETE’], # beginning of the generation
‘SELECT’: [‘name’, ’email’, ‘id’], # names of columns in our schema
‘DELETE’: [‘name’, ’email’, ‘id’],
‘name’: [‘,’, ‘FROM’],
’email’: [‘,’, ‘FROM’],
‘id’: [‘,’, ‘FROM’],
‘,’: [‘name’, ’email’, ‘id’],
‘FROM’: [‘customers’, ‘vendors’], # names of tables in our schema
‘customers’: [‘</s>’],
‘vendors’: [‘</s>’], # end of the generation}

Now we need to convert these tokens to the IDs used by the model. This will happen inside a class inheriting from LogitsProcessor.

def convert_token_to_id(token):
return tokenizer(token, add_special_tokens=False)[‘input_ids’][0]
class SQLLogitsProcessor(LogitsProcessor):
def __init__(self, tokenizer: PreTrainedTokenizer):
self.tokenizer = tokenizer
self.rules = {convert_token_to_id(k): [convert_token_to_id(v0) for v0 in v] for k,v in rules.items()}

Finally, we will implement the __call__ function, which is called after the logits are calculated. The function creates a new tensor of -infs, checks which IDs are legitimate according to the rules (the dictionary), and places their scores in the new tensor. The result is a tensor that only has valid values for the valid tokens.

class SQLLogitsProcessor(LogitsProcessor):
def __init__(self, tokenizer: PreTrainedTokenizer):
self.tokenizer = tokenizer
self.rules = {convert_token_to_id(k): [convert_token_to_id(v0) for v0 in v] for k,v in rules.items()}
def __call__(self, input_ids: torch.LongTensor, scores: torch.LongTensor):
if not (input_ids == self.tokenizer.bos_token_id).any():
return scores
new_scores = torch.full((1, self.tokenizer.vocab_size), float(‘-inf’))
legit_ids = self.rules[int(input_ids[0, -1])]
new_scores[:, legit_ids] = scores[0, legit_ids]
return new_scores

And that’s it! We can now run a generation with the logits-processor:

to_translate = ‘customers emails from the us’
words = to_translate.split()
tokenized_text = tokenizer([words], is_split_into_words=True, return_offsets_mapping=True)
logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])
out = pretrained_model.generate(torch.tensor(tokenized_text[“input_ids”]), max_new_tokens=20, logits_processor=logits_processor)
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(out[0], skip_special_tokens=True))

Returning SELECT email , email , id , email FROM customers

The outcome is a little strange, but remember: we didn’t even train the model! We only enforced token generation based on specific rules. Notably, constraining generation doesn’t interfere with training; constraints only apply during generation post-training. Thus, when appropriately implemented, these constraints can only enhance generation accuracy.

Our simplistic implementation falls short of covering all the SQL syntax. A real implementation must support more syntax, potentially considering not just the last token but several, and enable batch generation. Once these enhancements are in place, our trained model can reliably generate executable SQL queries, constrained to valid table and column names from the schema. A Similar approach can enforce constraints in generating JSON, ensuring key presence and bracket closure.

Be careful of tokenization

Tokenization is often overlooked but correct tokenization is crucial when using generative AI for structured output. However, under the hood, tokenization can make an impact on the training of your model. For example, you may fine-tune a model to translate text into a JSON. As part of the fine-tuning process, you provide the model with examples of text-JSON pairs, which it tokenizes. What will this tokenization look like?

While you read “[[” as two square brackets, the tokenizer converts them into a single ID, which will be treated as a completely distinct class from the single bracket by the token classifier. This makes the entire logic that the model must learn — more complicated (for example, remembering how many brackets to close). Similarly, adding a space before words may change their tokenization and their class ID.

For simpler learning, ensure each concept and punctuation is consistently converted to the same token, by adding spaces before words and characters.

Spaced-out words lead to more consistent tokenization (Image by author)

Inputting spaced examples during fine-tuning simplifies the patterns the model has to learn, enhancing model accuracy. During prediction, the model will output the JSON with spaces, which you can then remove before parsing.

Summary

Generative AI offers a valuable approach for translating into a formatted language. By leveraging the knowledge of the output structure, we can constrain the generative process, eliminating a class of errors and ensuring the executability of queries and parse-ability of data structures.

Additionally, these formats may use punctuation and keywords to signify certain meanings. Making sure that the tokenization of these keywords is consistent can dramatically reduce the complexity of the patterns that the model has to…






Source link

Tags: AprconstraingenerativeMatarmodelOrenoutputâStructured
Previous Post

Making Sure Your Pool Is Safe and Fun for the Whole Family

Next Post

To build a better AI helper, start by modeling the irrational behavior of humans | MIT News

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
To build a better AI helper, start by modeling the irrational behavior of humans | MIT News

To build a better AI helper, start by modeling the irrational behavior of humans | MIT News

Megalodon: A Deep Learning Architecture for Efficient Sequence Modeling with Unlimited Context Length

Megalodon: A Deep Learning Architecture for Efficient Sequence Modeling with Unlimited Context Length

Next-Gen AI: OpenAI and Meta’s Leap Towards Reasoning Machines

Next-Gen AI: OpenAI and Meta's Leap Towards Reasoning Machines

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In